This dataset collects information from 100k medical appointments in Brazil and is focused on the question of whether or not patients show up for their appointment. A number of characteristics about the patient are included in each row.
- Whats couse patient not to show up on their appointment
- whats the relationship between thier desease and not showing up
- ScheduledDay tells us on what day the patient set up their appointment.
- Neighborhood indicates the location of the hospital.
- Scholarship indicates whether or not the patient is enrolled in Brasilian welfare program Bolsa Família.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sas
import plotly_express as px
df= pd.read_csv("dataset/appointments.csv")
df.head()
| PatientId | AppointmentID | Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.987250e+13 | 5642903 | F | 2016-04-29T18:38:08Z | 2016-04-29T00:00:00Z | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No |
| 1 | 5.589978e+14 | 5642503 | M | 2016-04-29T16:08:27Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 2 | 4.262962e+12 | 5642549 | F | 2016-04-29T16:19:04Z | 2016-04-29T00:00:00Z | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 3 | 8.679512e+11 | 5642828 | F | 2016-04-29T17:29:31Z | 2016-04-29T00:00:00Z | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 4 | 8.841186e+12 | 5642494 | F | 2016-04-29T16:07:23Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | No |
df.shape
(110527, 14)
df.columns
Index(['PatientId', 'AppointmentID', 'Gender', 'ScheduledDay',
'AppointmentDay', 'Age', 'Neighbourhood', 'Scholarship', 'Hipertension',
'Diabetes', 'Alcoholism', 'Handcap', 'SMS_received', 'No-show'],
dtype='object')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 110527 entries, 0 to 110526 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PatientId 110527 non-null float64 1 AppointmentID 110527 non-null int64 2 Gender 110527 non-null object 3 ScheduledDay 110527 non-null object 4 AppointmentDay 110527 non-null object 5 Age 110527 non-null int64 6 Neighbourhood 110527 non-null object 7 Scholarship 110527 non-null int64 8 Hipertension 110527 non-null int64 9 Diabetes 110527 non-null int64 10 Alcoholism 110527 non-null int64 11 Handcap 110527 non-null int64 12 SMS_received 110527 non-null int64 13 No-show 110527 non-null object dtypes: float64(1), int64(8), object(5) memory usage: 11.8+ MB
- The ScheduledDay are object not datetime
- The AppointmentDay are object not datetime
- no-show is object
- There are no missing values
df.isnull().sum().sum()
0
df.describe().transpose()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| PatientId | 110527.0 | 1.474963e+14 | 2.560949e+14 | 3.921784e+04 | 4.172614e+12 | 3.173184e+13 | 9.439172e+13 | 9.999816e+14 |
| AppointmentID | 110527.0 | 5.675305e+06 | 7.129575e+04 | 5.030230e+06 | 5.640286e+06 | 5.680573e+06 | 5.725524e+06 | 5.790484e+06 |
| Age | 110527.0 | 3.708887e+01 | 2.311020e+01 | -1.000000e+00 | 1.800000e+01 | 3.700000e+01 | 5.500000e+01 | 1.150000e+02 |
| Scholarship | 110527.0 | 9.826558e-02 | 2.976748e-01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 |
| Hipertension | 110527.0 | 1.972459e-01 | 3.979213e-01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 |
| Diabetes | 110527.0 | 7.186479e-02 | 2.582651e-01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 |
| Alcoholism | 110527.0 | 3.039981e-02 | 1.716856e-01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 |
| Handcap | 110527.0 | 2.224796e-02 | 1.615427e-01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.000000e+00 |
| SMS_received | 110527.0 | 3.210256e-01 | 4.668727e-01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 |
df_corr=df.corr()
df_corr.style.background_gradient(cmap='coolwarm', axis=None)
| PatientId | AppointmentID | Age | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | |
|---|---|---|---|---|---|---|---|---|---|
| PatientId | 1.000000 | 0.004039 | -0.004139 | -0.002880 | -0.006441 | 0.001605 | 0.011011 | -0.007916 | -0.009749 |
| AppointmentID | 0.004039 | 1.000000 | -0.019126 | 0.022615 | 0.012752 | 0.022628 | 0.032944 | 0.014106 | -0.256618 |
| Age | -0.004139 | -0.019126 | 1.000000 | -0.092457 | 0.504586 | 0.292391 | 0.095811 | 0.078033 | 0.012643 |
| Scholarship | -0.002880 | 0.022615 | -0.092457 | 1.000000 | -0.019729 | -0.024894 | 0.035022 | -0.008586 | 0.001194 |
| Hipertension | -0.006441 | 0.012752 | 0.504586 | -0.019729 | 1.000000 | 0.433086 | 0.087971 | 0.080083 | -0.006267 |
| Diabetes | 0.001605 | 0.022628 | 0.292391 | -0.024894 | 0.433086 | 1.000000 | 0.018474 | 0.057530 | -0.014550 |
| Alcoholism | 0.011011 | 0.032944 | 0.095811 | 0.035022 | 0.087971 | 0.018474 | 1.000000 | 0.004648 | -0.026147 |
| Handcap | -0.007916 | 0.014106 | 0.078033 | -0.008586 | 0.080083 | 0.057530 | 0.004648 | 1.000000 | -0.024161 |
| SMS_received | -0.009749 | -0.256618 | 0.012643 | 0.001194 | -0.006267 | -0.014550 | -0.026147 | -0.024161 | 1.000000 |
df.nunique()
PatientId 62299 AppointmentID 110527 Gender 2 ScheduledDay 103549 AppointmentDay 27 Age 104 Neighbourhood 81 Scholarship 2 Hipertension 2 Diabetes 2 Alcoholism 2 Handcap 5 SMS_received 2 No-show 2 dtype: int64
px.box(df,y='Age',title='The points outlier the Age column')
f,ax=plt.subplots(figsize=(8,8))
df_corr=df.corr()
sas.heatmap(df_corr,annot=True)
<AxesSubplot:>
df.hist(figsize=(10,10));
sas.pairplot(df,diag_kind='kde')
<seaborn.axisgrid.PairGrid at 0x1536c2be0>
df['Age'].min()
-1
df.loc[df['Age']<0]
| PatientId | AppointmentID | Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 99832 | 4.659432e+14 | 5775010 | F | 2016-06-06T08:58:13Z | 2016-06-06T00:00:00Z | -1 | ROMÃO | 0 | 0 | 0 | 0 | 0 | 0 | No |
sas.set()
f,ax=plt.subplots()
ax.hist(df['Age'])
plt.title('The the distribution of the age column',fontsize=20)
plt.ylabel('number of pationts ',fontsize=12)
plt.xlabel('Ages')
plt.show();
- the average of age is 37
- the oldest patient is 115
- The percentage of people with Handicap is small
- The proportion of people delivered to whom messages were delivered is less than the average
- The percentage of people addicted to alcohol is small
- The percentage of people who suffer from chronic diseases is a small percentage
df.rename(columns=lambda x: x.lower().replace('-','_'),inplace=True)
df.head()
| patientid | appointmentid | gender | scheduledday | appointmentday | age | neighbourhood | scholarship | hipertension | diabetes | alcoholism | handcap | sms_received | no_show | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.987250e+13 | 5642903 | F | 2016-04-29T18:38:08Z | 2016-04-29T00:00:00Z | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No |
| 1 | 5.589978e+14 | 5642503 | M | 2016-04-29T16:08:27Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 2 | 4.262962e+12 | 5642549 | F | 2016-04-29T16:19:04Z | 2016-04-29T00:00:00Z | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 3 | 8.679512e+11 | 5642828 | F | 2016-04-29T17:29:31Z | 2016-04-29T00:00:00Z | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 4 | 8.841186e+12 | 5642494 | F | 2016-04-29T16:07:23Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | No |
show=df.no_show=='Yes'
not_show=df.no_show=='No'
df['scheduledday']=pd.to_datetime(df['scheduledday'])
df['scheduledday_date']=df['scheduledday'].dt.date
df['appointmentday']=pd.to_datetime(df['appointmentday'])
df['appointmentday_date']=df['appointmentday'].dt.date
df['waiting_time']=(df['appointmentday_date']-df['scheduledday_date']).dt.days
df['waiting_time']=df['waiting_time'].astype(int)
df.head()
| patientid | appointmentid | gender | scheduledday | appointmentday | age | neighbourhood | scholarship | hipertension | diabetes | alcoholism | handcap | sms_received | no_show | scheduledday_date | appointmentday_date | waiting_time | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.987250e+13 | 5642903 | F | 2016-04-29 18:38:08+00:00 | 2016-04-29 00:00:00+00:00 | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No | 2016-04-29 | 2016-04-29 | 0 |
| 1 | 5.589978e+14 | 5642503 | M | 2016-04-29 16:08:27+00:00 | 2016-04-29 00:00:00+00:00 | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No | 2016-04-29 | 2016-04-29 | 0 |
| 2 | 4.262962e+12 | 5642549 | F | 2016-04-29 16:19:04+00:00 | 2016-04-29 00:00:00+00:00 | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | No | 2016-04-29 | 2016-04-29 | 0 |
| 3 | 8.679512e+11 | 5642828 | F | 2016-04-29 17:29:31+00:00 | 2016-04-29 00:00:00+00:00 | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | No | 2016-04-29 | 2016-04-29 | 0 |
| 4 | 8.841186e+12 | 5642494 | F | 2016-04-29 16:07:23+00:00 | 2016-04-29 00:00:00+00:00 | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | No | 2016-04-29 | 2016-04-29 | 0 |
df.drop(df.loc[df['age']<0].index,axis=0,inplace=True)
df.loc[df['age']<0]
| patientid | appointmentid | gender | scheduledday | appointmentday | age | neighbourhood | scholarship | hipertension | diabetes | alcoholism | handcap | sms_received | no_show | scheduledday_date | appointmentday_date | waiting_time |
|---|
df['no_show']=df['no_show'].astype('category')
df['no_show']=df['no_show'].cat.codes
print(df['no_show'].dtypes)
int8
df.corr()
f, ax = plt.subplots(figsize=(10,10))
sas.heatmap(df.corr(),annot=True);
# plot
def plotMygraph(possition, dataPoint, title):
plt.subplot(3,2,possition)
hip_mab=dataPoint.map({1:'yes',0:'no'})
sas.countplot(hip_mab,data=df,hue='no_show')
plt.title(title,fontsize=15)
plt.legend(title='show',labels=['no','yes'])
#hipertension
plotMygraph(1, df['hipertension'], 'Hipertension effect to patient show')
#for diabetes
plotMygraph(2, df['diabetes'], 'Diabetes effect to patient show')
#handcap
plotMygraph(3, df['handcap'], 'Handcap effect to patient show')
#alcoholism
plotMygraph(4, df['alcoholism'], 'Alcoholism effect to patient show')
#sms_received
plotMygraph(5, df['sms_received'], 'SMS delivery effect to patient show')
#scholarship
plotMygraph(6, df['scholarship'], 'Scholarship effect to patient show')
plt.subplots_adjust(left=0,right=1.5,bottom=0,top=2.5,wspace=0.3,hspace=0.3)
/Users/itailouiszulu/tensorflow-test/env/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. /Users/itailouiszulu/tensorflow-test/env/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. /Users/itailouiszulu/tensorflow-test/env/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. /Users/itailouiszulu/tensorflow-test/env/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. /Users/itailouiszulu/tensorflow-test/env/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. /Users/itailouiszulu/tensorflow-test/env/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
- Clinic locations.
- attendance rates vary from one place to another. It can be said that the clinic locations affect the attendance rates of patients.
Gender,
- females are more not to show up than males,
diseases
- they do not clearly affect attendance rates
The number of females is great than males
the number of females is great than males